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Abstract 

Lossy compression introduces complex compression ar¬ 
tifacts, particularly the blocking artifacts, ringing effects 
and blurring. Existing algorithms either focus on remov¬ 
ing blocking artifacts and produce blurred output, or re¬ 
stores sharpened images that are accompanied with ring- 
ing effects. Inspired by the deep convolutional networks 
(DCN) on super-resolution [4], we formulate a compact 
and efficient network for seamless attenuation of different 
compression artifacts. We also demonstrate that a deeper 
model can be effectively trained with the features learned 
in a shallow network. Following a similar (< easy to hard” 
idea, we systematically investigate several practical trans¬ 
fer settings and show the effectiveness of transfer learning 
in low-level vision problems. Our method shows superior 
performance than the state-of-the-arts both on the bench¬ 
mark datasets and the real-world use case (i.e. Twitter). In 
addition, we show that our method can be applied as pre¬ 
processing to facilitate other low-level vision routines when 
they take compressed images as input. 

1. Introduction 

Lossy compression ( e.g. JPEG, WebP and HEVC-MSP) 
is one class of data encoding methods that uses inexact 
approximations for representing the encoded content. In 
this age of information explosion, lossy compression is 
indispensable and inevitable for companies (e.g. Twitter 
and Facebook) to save bandwidth and storage space. How¬ 
ever, compression in its nature will introduce undesired 
complex artifacts, which will severely reduce the user ex¬ 
perience (e.g. Figure 1). All these artifacts not only de¬ 
crease perceptual visual quality, but also adversely affect 
various low-level image processing routines that take com¬ 
pressed images as input, e.g. contrast enhancement [14], 
super-resolution [28, 4], and edge detection [2]. However, 
under such a huge demand, effective compression artifacts 
reduction remains an open problem. 

We take JPEG compression as an example to explain 
compression artifacts. JPEG compression scheme divides 



(a) Left: the JPEG-compressed image, where we could see blocking arti¬ 
facts, ringing effects and blurring on the eyes, abrupt intensity changes on 
the face. Right: the restored image by the proposed deep model (AR-CNN), 
where we remove these compression artifacts and produce sharp details. 



(b) Left: the Twitter-compressed image, which is first re-scaled to a small 
image and then compressed on the server-side. Right: the restored image by 
the proposed deep model (AR-CNN) 

Figure 1. Example compressed images and our restoration results 
on the JPEG compression scheme and the real use case - Twitter. 

an image into 8x8 pixel blocks and applies block discrete 
cosine transformation (DCT) on each block individually. 
Quantization is then applied on the DCT coefficients to 
save storage space. This step will cause a complex com¬ 
bination of different artifacts, as depicted in Figure 1(a). 
Blocking artifacts arise when each block is encoded with¬ 
out considering the correlation with the adjacent blocks, re¬ 
sulting in discontinuities at the 8x8 borders. Ringing ef¬ 
fects along the edges occur due to the coarse quantization 
of the high-frequency components (also known as Gibbs 
phenomenon [ 8 ]). Blurring happens due to the loss of 
high-frequency components. To cope with the various com¬ 
pression artifacts, different approaches have been proposed, 
some of which can only deal with certain types of artifacts. 
For instance, deblocking oriented approaches [16, 19, 2 ] 
perform filtering along the block boundaries to reduce only 



blocking artifacts. Liew et al. [15] and Foi et al. [: ] 
use thresholding by wavelet transform and Shape-Adaptive 
DCT transform, respectively. These approaches are good at 
removing blocking and ringing artifacts, but tend to produce 
blurred output. Jung et al. [ 1 ] propose restoration method 
based on sparse representation. They produce sharpened 
images but accompanied with noisy edges and unnatural 
smooth regions. 

To date, deep learning has shown impressive results on 
both high-level and low-level vision problems . In particu¬ 
lar, the SRCNN proposed by Dong et al. [ 4 ] shows the great 
potential of an end-to-end DCN in image super-resolution. 
The study also points out that conventional sparse-coding- 
based image restoration model can be equally seen as a deep 
model. However, we find that the three-layer network is not 
well suited in restoring the compressed images, especially 
in dealing with blocking artifacts and handling smooth re¬ 
gions. As various artifacts are coupled together, features ex¬ 
tracted by the first layer is noisy, causing undesirable noisy 
patterns in reconstruction. 

To eliminate the undesired artifacts, we improve the SR¬ 
CNN by embedding one or more “feature enhancement” 
layers after the first layer to clean the noisy features. Experi¬ 
ments show that the improved model, namely 4 Artifacts Re¬ 
duction Convolutional Neural Networks (AR-CNN)”, is ex¬ 
ceptionally effective in suppressing blocking artifacts while 
retaining edge patterns and sharp details (see Figure 1). 
However, we are met with training difficulties in training 
a deeper DCN. “Deeper is better” is widely observed in 
high-level vision problems, but not in low-level vision tasks. 
Specifically, “deeper is not better” has been pointed out in 
super-resolution [3], where training a five-layer network be¬ 
comes a bottleneck. The difficulty of training is partially 
due to the sub-optimal initialization settings. 

The aforementioned difficulty motivates us to investigate 
a better way to train a deeper model for low-level vision 
problems. We find that this can be effectively solved by 
transferring the features learned in a shallow network to 
a deeper one and fine-tuning simultaneously 1 . This strat¬ 
egy has also been proven successful in learning a deeper 
CNN for image classification [22 ]. Following a similar gen¬ 
eral intuitive idea, easy to hard , we discover other interest¬ 
ing transfer settings in this low-level vision task: (1) We 
transfer the features learned in a high-quality compression 
model (easier) to a low-quality one (harder), and find that 
it converges faster than random initialization. ( 2 ) In the 
real use case, companies tend to apply different compres¬ 
sion strategies (including re-scaling) according to their pur¬ 
poses ( e.g. Figure 1(b)). We transfer the features learned 

1 Generally, the transfer learning method will train a base network first, 
and copy the learned parameters or features of several layers to the corre¬ 
sponding layers of a target network. These transferred layers can be left 
frozen or fine-tuned to the target dataset. The remaining layers are ran¬ 
domly initialized and trained to the target task. 


in a standard compression model (easier) to a real use case 
(harder), and find that it performs better than learning from 
scratch. 

The contributions of this study are three-fold: (1) We 
formulate a new deep convolutional network for efficient 
reduction of various compression artifacts. Extensive ex¬ 
periments, including that on real use cases, demonstrate 
the effectiveness of our method over state-of-the-art meth¬ 
ods [5, 11] both perceptually and quantitatively. (2) We ver¬ 
ify that reusing the features in shallow networks is helpful 
in learning a deeper model for compression artifact reduc¬ 
tion. Under the same intuitive idea - easy to hard , we reveal 
a number of interesting and practical transfer settings. Our 
study is the first attempt to show the effectiveness of fea¬ 
ture transfer in a low-level vision problem. (3) We show 
the effectiveness of AR-CNN in facilitating other low-level 
vision routines ( i.e. super-resolution and contrast enhance¬ 
ment), when they take JPEG images as input. 

2. Related work 

Existing algorithms can be classified into deblocking ori¬ 
ented and restoration oriented methods. The deblocking 
oriented methods focus on removing blocking and ring¬ 
ing artifacts. In the spatial domain, different kinds of fil¬ 
ters [16, 19, 24] have been proposed to adaptively deal with 
blocking artifacts in specific regions (e.g., edge, texture, 
and smooth regions). In the frequency domain, Liew et 
al. [15] utilize wavelet transform and derive thresholds at 
different wavelet scales for denoising. The most success¬ 
ful deblocking oriented method is perhaps the Pointwise 
Shape-Adaptive DCT (SA-DCT) [5], which is widely ac¬ 
knowledged as the state-of-the-art approach [11,1 ]. How¬ 
ever, as most deblocking oriented methods, SA-DCT could 
not reproduce sharp edges, and tend to overly smooth tex¬ 
ture regions. The restoration oriented methods regard the 
compression operation as distortion and propose restoration 
algorithms. They include projection on convex sets based 
method (POCS) [30], solving an MAP problem (FoE) [23], 
sparse-coding-based method [12] and the Regression Tree 
Fields based method (RTF) [11], which is the new state-of- 
the art method. The RTF takes the results of SA-DCT [5] as 
bases and produces globally consistent image reconstruc¬ 
tions with a regression tree field model. It could also be 
optimized for any differentiable loss functions (e.g. SSIM), 
but often at the cost of other evaluation metrics. 

Super-Resolution Convolutional Neural Network (SR¬ 
CNN) [4] is closely related to our work. In the study, in¬ 
dependent steps in the sparse-coding-based method are for¬ 
mulated as different convolutional layers and optimized in 
a unified network. It shows the potential of deep model in 
low-level vision problems like super-resolution. However, 
the model of compression is different from super-resolution 
in that it consists of different kinds of artifacts. Designing 
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Figure 2. The framework of the Artifacts Reduction Convolutional Neural Network (AR-CNN). The network consists of four convolutional 
layers, each of which is responsible for a specific operation. Then it optimizes the four operations (i.e., feature extraction, feature enhance¬ 
ment, mapping and reconstruction) jointly in an end-to-end framework. Example feature maps shown in each step could well illustrate the 
functionality of each operation. They are normalized for better visualization. 


a deep model for compression restoration requires a deep 
understanding into the different artifacts. We show that di¬ 
rectly applying the SRCNN architecture for compression 
restoration will result in undesired noisy patterns in the re¬ 
constructed image. 

Transfer learning in deep neural networks becomes pop¬ 
ular since the success of deep learning in image classifica¬ 
tion [13]. The features learned from the ImageNet show 
good generalization ability [33] and become a powerful 
tool for several high-level vision problems, such as Pascal 
VOC image classification [18] and object detection [6, 20]. 
Yosinski et al. [32] have also tried to quantify the degree 
to which a particular layer is general or specific. Over¬ 
all, transfer learning has been systematically investigated 
in high-level vision problems, but not in low-level vision 
tasks. In this study, we explore several transfer settings on 
compression artifacts reduction and show the effectiveness 
of transfer learning in low-level vision problems. 

3. Methodology 

Our proposed approach is based on the current success¬ 
ful low-level vision model - SRCNN [4]. To have a better 
understanding of our work, we first give a brief overview of 
SRCNN. Then we explain the insights that lead to a deeper 
network and present our new model. Subsequently, we ex¬ 
plore three types of transfer learning strategies that help in 
training a deeper and better network. 

3.1. Review of SRCNN 

The SRCNN aims at learning an end-to-end mapping, 
which takes the low-resolution image Y (after interpola¬ 
tion) as input and directly outputs the high-resolution one 
F(Y). The network contains three convolutional layers, 
each of which is responsible for a specific task. Specifi¬ 
cally, the first layer performs patch extraction and repre¬ 
sentation, which extracts overlapping patches from the in¬ 
put image and represents each patch as a high-dimensional 


vector. Then the non-linear mapping layer maps each 
high-dimensional vector of the first layer to another high¬ 
dimensional vector, which is conceptually the representa¬ 
tion of a high-resolution patch. At last, the reconstruction 
layer aggregates the patch-wise representations to generate 
the final output. The network can be expressed as: 

Fi( Y) = max (0, W, * Y + B ,), i e {1, 2}; (1) 

F(Y) = W 3 *F 2 (Y) + B 3 . (2) 

where Wi and Bi represent the filters and biases of the it h 
layer respectively, Fi is the output feature maps and V de¬ 
notes the convolution operation. The Wi contains rii filters 
of support rii-i x fi x fi, where fi is the spatial support of 
a filter, rii is the number of filters, and no is the number of 
channels in the input image. Note that there is no pooling or 
full-connected layers in SRCNN, so the final output F( Y) 
is of the same size as the input image. Rectified Linear Unit 
(ReLU, max(0, x)) [17] is applied on the filter responses. 

These three steps are analogous to the basic operations 
in the sparse-coding-based super-resolution methods [29], 
and this close relationship lays theoretical foundation for its 
successful application in super-resolution. Details can be 
found in the paper [4]. 

3.2. Convolutional Neural Network for Compres¬ 
sion Artifacts Reduction 

Insights. In sparse-coding-based methods and SRCNN, 
the first step - feature extraction - determines what should 
be emphasized and restored in the following stages. How¬ 
ever, as various compression artifacts are coupled together, 
the extracted features are usually noisy and ambiguous for 
accurate mapping. In the experiments of reducing JPEG 
compression artifacts (see Section 4.1.2), we find that some 
quantization noises coupled with high frequency details 
are further enhanced, bringing unexpected noisy patterns 
around sharp edges. Moreover, blocking artifacts in flat 
areas are misrecognized as normal edges, causing abrupt 

















intensity changes in smooth regions. Inspired by the fea¬ 
ture enhancement step in super-resolution [27], we intro¬ 
duce a feature enhancement layer after the feature extrac¬ 
tion layer in SRCNN to form a new and deeper network 
- AR-CNN. This layer maps the “noisy” features to a rel¬ 
atively “cleaner” feature space, which is equivalent to de¬ 
noting the feature maps. 

Formulation. The overview of the new network AR- 
CNN is shown in Figure 2. The three layers of SRCNN 
remain unchanged in the new model. We also use the same 
annotations as in Section 3.1. To conduct feature enhance¬ 
ment, we extract new features from the n\ feature maps of 
the first layer, and combine them to form another set of fea¬ 
ture maps. This operation Fy can also be formulated as a 
convolutional layer: 


Fy (Y) = max (0, Wy * F 1 (Y) + By) , (3) 

where Wy corresponds to ny filters with size n\ x fy x 
fy . By is an ny -dimensional bias vector, and the output 
Fy (Y) consists of ny feature maps. Overall, the AR-CNN 
consists of four layers, namely the feature extraction, fea¬ 
ture enhancement, mapping and reconstruction layer. 

It is worth noticing that AR-CNN is not equal to a deeper 
SRCNN that contains more than one non-linear mapping 
layers 2 . Rather than imposing more non-linearity in the 
mapping stage, AR-CNN improves the mapping accuracy 
by enhancing the extracted low-level features. Experimen¬ 
tal results of AR-CNN, SRCNN and deeper SRCNN will be 
shown in Section 4.1.2 

3.3. Model Learning 

Given a set of ground truth images {X*} and their corre¬ 
sponding compressed images {Y^}, we use Mean Squared 
Error (MSE) as the loss function: 

L(0) = iV||F(Y i ;©)-X l || 2 , (4) 

n 

z=i 

where 0 = {Wy Wy , W 2 , W 3 , By By, B 2 , B 3 }, n is the 
number of training samples. The loss is minimized using 
stochastic gradient descent with the standard backpropaga- 
tion. We adopt a batch-mode learning method with a batch 
size of 128. 

3.4. Easy-Hard Transfer 

Transfer learning in deep models provides an effective 
way of initialization. In fact, conventional initialization 
strategies ( i.e . randomly drawn from Gaussian distributions 
with fixed standard deviations [13]) are found not suitable 
for training a very deep model, as reported in [9]. To address 
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Figure 3. Easy-hard transfer settings. First row: The baseline 4- 
layer network trained with dataA-qA. Second row: The 5-layer 
AR-CNN targeted at dataA-qA. Third row: The AR-CNN targeted 
at dataA-qB. Fourth row: The AR-CNN targeted at Twitter data. 
Green boxes indicate the transferred features from the base net¬ 
work, and gray boxes represent random initialization. The ellip¬ 
soidal bars between weight vectors represent the activation func¬ 
tions. 


this issue, He et al. [9] derive a robust initialization method 
for rectifier nonlinearities, Simony an et al. [22] propose to 
use the pre-trained features on a shallow network for initial¬ 
ization. 

In low-level vision problems ( e.g . super resolution), it is 
observed that training a network beyond 4 layers would en¬ 
counter the problem of convergence, even that a large num¬ 
ber of training images (e.g. ImageNet) are provided [4]. We 
are also met with this difficulty during the training process 
of AR-CNN. To this end, we systematically investigate sev¬ 
eral transfer settings in training a low-level vision network 
following an intuitive idea of “easy-hard transfer”. Specifi¬ 
cally, we attempt to reuse the features learned in a relatively 
easier task to initialize a deeper or harder network. Inter¬ 
estingly, the concept “easy-hard transfer” has already been 
pointed out in neuro-computation study [7], where the prior 
training on an easy discrimination can help learn a second 
harder one. 

Formally, we define the base (or source) task as A and the 
target tasks as Bi, i e {1,2,3}. As shown in Figure 3, the 
base network baseA is a four-layer AR-CNN trained on a 
large dataset dataA , of which images are compressed using 
a standard compression scheme with the compression qual¬ 
ity qA. All layers in baseA are randomly initialized from a 
Gaussian distribution. We will transfer one or two layers of 
baseA to different target tasks (see Figure 3). Such transfers 
can be described as follows. 

Transfer shallow to deeper model. As indicated by [3], 
a five-layer network is sensitive to the initialization param¬ 
eters and learning rate. Thus we transfer the first two layers 
of baseA to a five-layer network targetBi. Then we ran¬ 
domly initialize its remaining layers 3 and train all layers to¬ 
ward the same dataset dataA. This is conceptually similar to 


2 Adding non-linear mapping layers has been suggested as an extension 
of SRCNN in [ 4 ]. 


3 Random initialization on remaining layers are also applied similarly 
for tasks , and B3. 





(a) High compression quality (quality 20 in Matlab encoder) 



(b) Low compression quality (quality 10 in Matlab encoder) 
Figure 4. First layer filters of AR-CNN learned under different 
JPEG compression qualities. 


that applied in image classification [22], but this approach 
has never been validated in low-level vision problems. 

Transfer high to low quality. Images of low compres¬ 
sion quality contain more complex artifacts. Here we use 
the features learned from high compression quality images 
as a starting point to help learn more complicated features in 
the DCN. Specifically, the first layer of targetB 2 are copied 
from base A and trained on images that are compressed with 
a lower compression quality qB. 

Transfer standard to real use case. We then explore 
whether the features learned under a standard compression 
scheme can be generalized to other real use cases, which 
often contain more complex artifacts due to different levels 
of re-scaling and compression. We transfer the first layer of 
baseA to the network target £> 3 , and train all layers on the 
new dataset. 

Discussion. Why the features learned from relatively 
easy tasks are helpful? First, the features from a well- 
trained network can provide a good starting point. Then 
the rest of a deeper model can be regarded as shallow one, 
which is easier to converge. Second, features learned in dif¬ 
ferent tasks always have a lot in common. For instance, 
Figure 3.4 shows the features learned under different JPEG 
compression qualities. Obviously, filters a, 6, c of high qual¬ 
ity are very similar to filters a', 6', d of low quality. This 
kind of features can be reused or improved during fine- 
tuning, making the convergence faster and more stable. Fur¬ 
thermore, a deep network for a hard problem can be seen as 
an insufficiently biased learner with overly large hypothesis 
space to search, and therefore is prone to overfitting. These 
few transfer settings we investigate introduce good bias to 
enable the learner to acquire a concept with greater gener¬ 
ality. Experimental results in Section 4.2 validate the above 
analysis. 

4. Experiments 

We use the BSDS500 database [1] as our base training 
set. Specifically, its disjoint training set (200 images) and 
test set (200 images) are all used for training, and its valida¬ 
tion set (100 images) is used for validation. As in other 
compression artifacts reduction methods ( e.g . RTF [11]), 
we apply the standard JPEG compression scheme, and use 
the JPEG quality settings q = 20 (mid quality) and q = 
10 (low quality) in MATLAB JPEG encoder. We only fo¬ 


Table 1. The average results of PSNR (dB), SSIM, PSNR-B (dB) 
on the LIVE1 dataset. 


Eval. Mat 

Quality 

JPEG 

SA-DCT 

AR-CNN 

PSNR 

10 

27.77 

28.65 

28.98 


20 

30.07 

30.81 

31.29 

SSIM 

10 

0.7905 

0.8093 

0.8217 


20 

0.8683 

0.8781 

0.8871 

PSNR-B 

10 

25.33 

28.01 

28.70 


20 

27.57 

29.82 

30.76 


cus on the restoration of the luminance channel (in YCrCb 
space) in this paper. 

The training image pairs {Y, X} are prepared as follows 
- Images in the training set are decomposed into 32 x 32 
sub-images 4 X = {Xi}- l =1 . Then the compressed sam¬ 
ples Y = {Y i}^ =1 are generated from the training samples 
with MATLAB JPEG encoder [11]. The sub-images are ex¬ 
tracted from the ground truth images with a stride of 10. 
Thus the 400 training images could provide 537,600 train¬ 
ing samples. To avoid the border effects caused by convo¬ 
lution, AR-CNN produces a 20 x 20 output given a 32 x 32 
input Y im Hence, the loss (Eqn. (4)) was computed by com¬ 
paring against the center 20 x 20 pixels of the ground truth 
sub-image X$. In the training phase, we follow [10, 4] and 
use a smaller learning rate (10 -5 ) in the last layer and a 
comparably larger one (10 -4 ) in the remaining layers. 

4.1. Comparison with the State-of-the-Arts 

We use the LIVE1 dataset \2\ ] (29 images) as test set to 
evaluate both the quantitative and qualitative performance. 
The LIVE1 dataset contains images with diverse proper¬ 
ties. It is widely used in image quality assessment [25] 
as well as in super-resolution [28]. To have a comprehen¬ 
sive qualitative evaluation, we apply the PSNR, structural 
similarity (SSIM) [25] 5 , and PSNR-B [3 ] for quality as¬ 
sessment. We want to emphasize the use of PSNR-B. It 
is designed specifically to assess blocky and deblocked im¬ 
ages, thus is more sensitive to blocking artifacts than the 
perceptual-aware SSIM index. The network settings are 
fi = 9, f v = 7, h = 1, h = 5, m = 64, n v = 32, 
n 2 = 16 and n 3 = 1, denoted as AR-CNN (9-7-1-5) or sim¬ 
ply AR-CNN. A specific network is trained for each JPEG 
quality. Parameters are randomly initialized from a Gaus¬ 
sian distribution with a standard deviation of 0.001. 

4.1.1 Comparison with SA-DCT 

We first compare AR-CNN with SA-DCT [5], which is 
widely regarded as the state-of-the-art deblocking oriented 
method [11, 14]. The quantization results of PSNR, SSIM 
and PSNR-B are shown in Table 1 . On the whole, our AR- 

4 We use sub-images because we regard each sample as an image rather 
than a big patch. 

5 We use the unweighted structural similarity defined over fixed 8x8 
windows as in [26]. 














Table 2. The average results of PSNR (dB), SSIM, PSNR-B (dB) 
on the LIYE1 dataset with q — 10 . 


Eval. 

Mat 

JPEG 

SRCNN 

Deeper 

SRCNN 

AR-CNN 

PSNR 

27.77 

28.91 

28.92 

28.98 

SSIM 

0.7905 

0.8175 

0.8189 

0.8217 

PSNR-B 

25.33 

28.52 

28.46 

28.70 



Figure 5. Comparisons with SRCNN and Deeper SRCNN. 


CNN outperforms the SA-DCT on all JPEG qualities and 
evaluation metrics by a large margin. Note that the gains 
on PSNR-B is much larger than that on PSNR. This indi¬ 
cates that AR-CNN could produce images with less block¬ 
ing artifacts. To compare the visual quality, we present 
some restored images 6 with q = 10 in Figure 10. From 
Figure 10, we could see that the result of AR-CNN could 
produce much sharper edges with much less blocking and 
ringing artifacts compared with SA-DCT. The visual qual¬ 
ity has been largely improved on all aspects compared with 
the state-of-the-art method. Furthermore, AR-CNN is supe¬ 
rior to SA-DCT on the implementation speed. For SA-DCT, 
it needs 3.4 seconds to process a 256 x 256 image. While 
AR-CNN only takes 0.5 second. They are all implemented 
using C++ on a PC with Intel 13 CPU (3.1GHz) with 16GB 
RAM. 

4.1.2 Comparison with SRCNN 

As discussed in Section 3.2, SRCNN is not suitable for 
compression artifacts reduction. For comparison, we train 
two SRCNN networks with different settings, (i) The orig¬ 
inal SRCNN (9-1-5) with f t = 9, / 3 = 5, m = 64 and 
ri 2 = 32. (ii) Deeper SRCNN (9-1-1-5) with an additional 
non-linear mapping layer (fy = 1, ny = 16). They all use 
the BSDS500 dataset for training and validation as in Sec¬ 
tion 4. The compression quality is q — 10. The AR-CNN is 
the same as in Section 4.1.1. 

Quantitative results tested on FIVE1 dataset are shown 
in Table 2. We could see that the two SRCNN networks 
are inferior on all evaluation metrics. From convergence 
curves shown in Figure 5, it is clear that AR-CNN achieves 
higher PSNR from the beginning of the learning stage. Fur¬ 
thermore, from their restored images 6 in Figure 11, we find 
out that the two SRCNN networks all produce images with 
noisy edges and unnatural smooth regions. These results 
demonstrate our statements in Section 3.2. In short, the 

6 More qualitative results are provided in the supplementary file. 


Table 3. The average results of PSNR (dB), SSIM, PSNR-B (dB) 
on the test set BSDS500 dataset. 


Eval. 

Mat 

Quality 

JPEG 

RTF 

RTF 

+SA-DCT 

AR-CNN 

PSNR 

10 

26.62 

27.66 

27.71 

27.71 


20 

28.80 

29.84 

29.87 

29.87 

SSIM 

10 

0.7904 

0.8177 

0.8186 

0.8192 


20 

0.8690 

0.8864 

0.8871 

0.8857 

PSNR-B 

10 

23.54 

26.93 

26.99 

27.04 


20 

25.62 

28.80 

28.80 

29.02 


success of training a deep model needs comprehensive un¬ 
derstanding of the problem and careful design of the model 
structure. 

4.1.3 Comparison with RTF 

RTF [11] is the recent state-of-the-art restoration oriented 
method. Without their deblocking code, we can only com¬ 
pare with the released deblocking results. Their model is 
trained on the training set (200 images) of the BSDS500 
dataset, but all images are down-scaled by a factor of 
0.5 [11]. To have a fair comparison, we also train new AR- 
CNN networks on the same half-sized 200 images. Test¬ 
ing is performed on the test set of the BSDS500 dataset 
(images scaled by a factor of 0.5), which is also consistent 
with [1 ]. We compare with two RTF variants. One is the 
plain RTF, which uses the filter bank and is optimized for 
PSNR. The other is the RTF+SA-DCT, which includes the 
SA-DCT as a base method and is optimized for MAE. The 
later one achieves the highest PSNR value among all RTF 
variants [11]. 

As shown in Table 3, we obtain superior performance 
than the plain RTF, and even better performance than the 
combination of RTF and SA-DCT, especially under the 
more representative PSNR-B metric. Moreover, training on 
such a small dataset has largely restricted the ability of AR- 
CNN. The performance of AR-CNN will further improve 
given more training images. 

4.2. Experiments on Easy-Hard Transfer 

We show the experimental results of different “easy-hard 
transfer” settings, of which the details are shown in Table 4. 
Take the base network as an example, the base-qlO is a 
four-layer AR-CNN (9-7-1-5) trained on the BSDS500 [1] 
dataset (400 images) under the compression quality q = 
10. Parameters are initialized by randomly drawing from 
a Gaussian distribution with zero mean and standard devi¬ 
ation 0.001. Figures 6-8 show the convergence curves on 
the validation set. 

4.2.1 Transfer shallow to deeper model 

In Table 4, we denote a deeper (five-layer) AR-CNN as “9- 
7-3-1-5”, which contains another feature enhancement layer 
(fi" = 3 and n\>> — 16). Results in Figure 6 show that the 































Table 4. Experimental settings of “easy-hard transfer”. 


transfer 

short 

network 

training 

initialization 

strategy 

form 

structure 

dataset 

strategy 

base 

base-qlO 

9-7-1-5 

BSDS-qlO 

Gaussian (0, 0.001) 

network 

base-q20 

9-7-1-5 

BSDS-q20 

Gaussian (0, 0.001) 

shallow 

base-qlO 

9-7-1-5 

BSDS-qlO 

Gaussian (0, 0.001) 

to 

transfer deeper 

9-7-3-1-5 

BSDS-qlO 

1,2 layers of base-qlO 

deep 

He [9] 

9-7-3-1-5 

BSDS-qlO 

He etal. [9] 

high 

base-qlO 

9-7-1-5 

BSDS-qlO 

Gaussian (0, 0.001) 

to 

transfer 1 layer 

9-7-1-5 

BSDS-qlO 

1 layer of base-q20 

low 

transfer 2 layers 

9-7-1-5 

BSDS-qlO 

1,2 layer of base-q20 

standard 

base-Twitter 

9-7-1-5 

Twitter 

Gaussian (0, 0.001) 

to 

transfer qlO 

9-7-1-5 

Twitter 

1 layer of base-qlO 

real 

transfer q20 

9-7-1-5 

Twitter 

1 layer of base-q20 



Figure 6. Transfer shallow to deeper model. 



Figure 7. Transfer high to low quality. 



Figure 8. Transfer standard to real use case. 


transferred features from a four-layer network enable us to 
train a five-layer network successfully. Note that directly 
training a five-layer network using conventional initializa¬ 
tion ways is unreliable. Specifically, we have exhaustively 
tried different groups of learning rates, but still have not 
observed convergence. Furthermore, the “transfer deeper” 
converges faster and achieves better performance than using 
He et al.’s method [9], which is also very effective in train¬ 
ing a deep model. We have also conducted comparative ex¬ 
periments with the structure “9-7-1-1-5” and observed the 
same trend. 

4.2.2 Transfer high to low quality 

Results are shown in Figure 7. Obviously, the two networks 
with transferred features converge faster than that training 
from scratch. For example, to reach an average PSNR 
of 27.77dB, the “transfer 1 layer” takes only 1.54 x 10 8 
backprops, which are roughly a half of that for “base-qlO”. 
Moreover, the “transfer 1 layer” also outperforms the ‘base- 


qlO” by a slight margin throughout the training phase. One 
reason for this is that only initializing the first layer pro¬ 
vides the network with more flexibility in adapting to a new 
dataset. This also indicates that a good starting point could 
help train a better network with higher convergence speed. 

4.2.3 Transfer standard to real use case - Twitter 

Online Social Media like Twitter are popular platforms for 
message posting. However, Twitter will compress the up¬ 
loaded images on the server-side. For instance, a typical 
8 mega-pixel (MP) image (3264 x 2448) will result in a 
compressed and re-scaled version with a fixed resolution 
of 600 x 450. Such re-scaling and compression will intro¬ 
duce very complex artifacts, making restoration difficult for 
existing deblocking algorithms ( e.g . SA-DCT). However, 
AR-CNN can fit to the new data easily. Further, we want 
to show that features learned under standard compression 
schemes could also facilitate training on a completely dif¬ 
ferent dataset. We use 40 photos of resolution 3264 x 2448 
taken by mobile phones (totally 335,209 training subim¬ 
ages) and their 7wfiter-compressed version 7 to train three 
networks with initialization settings listed in Table 4. 

From Figure 8, we observe that the “transfer glO” 
and “transfer g20” networks converge much faster than 
the “base-Twitter” trained from scratch. Specifically, the 
“transfer glO” takes 6 x 10 7 backprops to achieve 25.1dB, 
while the “base-Twitter” uses 10 x 10 7 backprops. Despite 
of fast convergence, transferred features also lead to higher 
PSNR values compared with “base-Twitter”. This observa¬ 
tion suggests that features learned under standard compres¬ 
sion schemes are also transferrable to tackle real use case 
problems. Some restoration results 6 are shown in Figure 12. 
We could see that both networks achieve satisfactory quality 
improvements over the compressed version. 

5. Application 

In the real application, many image processing routines 
are affected when they take JPEG images as input. Blocking 
artifacts could be either super-resolved or enhanced, caus¬ 
ing significant performance decrease. In this section, we 
show the potential of AR-CNN in facilitating other low- 
level vision studies, i.e. super-resolution and contrast en¬ 
hancement. To illustrate this, we use SRCNN [4] for super¬ 
resolution and tone-curve adjustment [14] for contrast en¬ 
hancement [2], and show example results when the input is 
a JPEG image, SA-DCT deblocked image, and AR-CNN 
restored image. From results shown in Figure 9, we could 
see that JPEG compression artifacts have greatly distorted 
the visual quality in super-resolution and contrast enhance¬ 
ment. Nevertheless, with the help of AR-CNN, these effects 

7 We will share this dataset on our project page. 









































Original JPEG SA-DCT AR-CNN 

PSNR /SSIM /PSNR-B 32.46 dB /0.8558 /29.64 dB 33.88 dB /0.9015 /33.02 dB 34.37 dB /0.9079 /34.10 dB 


Figure 10. Results on image “parrots” show that AR-CNN is better than SA-DCT on removing blocking artifacts. 



JPEG SRCNN Deeper SRCNN AR-CNN 

30.12 dB /0.8817 /26.86 dB 32.58 dB /0.9298 /31.52 dB 32.60 dB /0.9301 /31.47 dB 32.88 dB /0.9343 /32.22 dB 

Figure 11. Results on image “monarch” show that AR-CNN is better than SRCNN on removing ringing effects. 



Original / PSNR Twitter / 26.55 dB Bas e-Twitter / 27.75 dB Transfer qlO / 27.92 dB 

Figure 12. Restoration results of AR-CNN on Twitter compressed images. The origina image (8MP version) is too large for display and 
only part of the image is shown for better visualization. 



(b) Contrast enhancement output 

Figure 9. AR-CNN can be applied as pre-processing to facilitate 
other low-level routines when they take JPEG images as input. 


them are more evident after these low-level vision process¬ 
ing. 

6. Conclusion 

Applying deep model on low-level vision problems re¬ 
quires deep understanding of the problem itself. In this pa¬ 
per, we carefully study the compression process and pro¬ 
pose a four-layer convolutional network, AR-CNN, which 
is extremely effective in dealing with various compres¬ 
sion artifacts. We further systematically investigate several 
easy-to-hard transfer settings that could facilitate training 
a deeper or better network, and verify the effectiveness of 
transfer learning in low-level vision problems. As discussed 
in SRCNN [4], we find that larger filter sizes also help im¬ 
prove the performance. We will leave them to further work. 


have been largely eliminated. Moreover, AR-CNN achieves 
much better results than SA-DCT. The differences between 
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